A Multimodal Approach to Device-Directed Speech Detection with Large Language ModelsDominik Wager, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchihttps://arxiv.org/abs/2403.14438
A Multimodal Approach to Device-Directed Speech Detection with Large Language ModelsInteractions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) syst…